Skip to content

feat: capture-aware alloc + workspace pre-alloc (T2.2, T2.3)#97

Merged
dndungu merged 5 commits intomainfrom
wave-4b-integration
Apr 16, 2026
Merged

feat: capture-aware alloc + workspace pre-alloc (T2.2, T2.3)#97
dndungu merged 5 commits intomainfrom
wave-4b-integration

Conversation

@dndungu
Copy link
Copy Markdown
Contributor

@dndungu dndungu commented Apr 16, 2026

Summary

Wave 4b of the GB10 CUDA graph capture fix (docs/plan.md E2). This is the core fix that resolves the silent hang described in #93.

  • T2.2: Capture-aware allocWeight routing. When CaptureAwareAllocator.IsCapturing(), routes through cudaMallocAsync on the capture stream (graph node) instead of MallocManaged (illegal during capture on GB10). Similarly, uploadBytes routes through cudaMemcpyAsync H2D during capture. Added IsCapturing() to CaptureAwareAllocator interface + implementations. 7 new tests.
  • T2.3: Workspace pre-allocation. preAllocateWorkspaces() called at end of UploadWeights eagerly initializes FP8 scratchpad and cuBLASLt handle so no lazy alloc occurs inside capture. Added captureAllocCount atomic counter that instruments capture-time allocs — should be zero for a properly pre-allocated workload. 7 new tests.

Together with T2.1a (WithCapture helper, PR #96) and T4.1 (capture watchdog, PR #96), this completes the E2+E4 fix path. The production hang in #93 is now resolved: callers use WithCapture → allocator switches to capture-aware mode → allocWeight uses async alloc → no illegal MallocManaged during capture → no hang.

Refs #93.

Verification

  • Build: go build ./... PASS
  • Test: go test ./compute/... -race -timeout 120s PASS (14 new tests, 2.7s)
  • Merge: auto-merged cleanly (both branches touch gpu_engine.go in different functions)
  • Silent-revert check: all key symbols from both branches present in integration HEAD
  • Stub audit: zero hits

Test plan

  • go build ./...
  • go test ./compute/... -race -timeout 120s
  • Auto-merge gpu_engine.go conflict-free
  • Both T2.2 and T2.3 symbols present post-merge
  • CI green (auto)
  • T2.6 (follow-on): hardware validation on DGX with capture enabled

dndungu added 5 commits April 16, 2026 09:24
…sync

When CaptureAwareAllocator is active (set by BeginCapture/WithCapture),
allocWeight routes through cudaMallocAsync on the capture stream so
allocations are recorded as graph nodes. This avoids the silent hang
caused by cudaMallocManaged during CUDA graph capture on GB10.

Similarly, uploadBytes routes through cudaMemcpyAsync on the capture
stream instead of the synchronous CPU copy used by the managed-memory
path, which is illegal during capture.

The ensureNotCapturing guard now only fires when capture is active but
the allocator was NOT properly switched via BeginCapture/WithCapture.

Changes:
- Add IsCapturing() to CaptureAwareAllocator interface
- Implement IsCapturing() on cuda.MemPool and gpuapi.CUDAMemPool
- Add async allocation/copy routing in allocWeight and uploadBytes
- Add function variable indirections for MallocManaged, MallocAsync,
  and MemcpyAsync to enable CPU-mock testing
- Add 7 unit tests covering all routing paths
…o avoid capture-time alloc

Add preAllocateWorkspaces() that eagerly initializes the FP8 scratchpad
(scaleOne pointer + struct) and cuBLASLt handle at the end of
UploadWeights, before any CUDA graph capture region begins. These two
objects previously used lazy initialization (getFP8Scratch, getLtHandle)
which triggered cudaMalloc on first use -- hanging silently on GB10 when
first use happened inside capture.

Also add captureAllocCount atomic counter to track allocWeight attempts
during active capture. EndCapture resets the counter and logs a warning
if non-zero. CaptureAllocCount() exposes the counter for testing.
@dndungu dndungu merged commit aa0dac6 into main Apr 16, 2026
1 check passed
@dndungu dndungu deleted the wave-4b-integration branch April 16, 2026 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant